CLN: Consolidate decimal string parsing functions in tokenizer.c #62823

heoh · 2025-10-24T18:09:36Z

closes Consolidate decimal digit -> str functions in tokenizer.c #62717
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Description

This PR consolidates two similar string manipulation functions in tokenizer.c:

_str_copy_decimal_str_c: handled decimal/thousands separator replacement with heap allocation
copy_string_without_char: removed characters with caller-provided buffer

The refactored _str_copy_decimal_str_c now combines the strengths of both approaches:

Uses caller-provided stack buffer (no heap allocation needed)
Processes multiple character replacements
Uses block memory operations (strspn, memcpy) instead of byte-by-byte processing
Provides comprehensive validation and error handling

Changes

Refactored _str_copy_decimal_str_c to write to caller-provided buffer instead of heap allocation
Added helper functions and macros (str_consume_span, etc.) for efficient string parsing
Removed copy_string_without_char function
Updated str_to_int64 and str_to_uint64 to use the refactored function
Simplified round_trip function with stack buffer allocation

…out_char`

Remove heap allocation (malloc/free) in round_trip by using stack-allocated buffer with PROCESSED_WORD_CAPACITY. This improves performance and simplifies memory management.

Alvaro-Kothe · 2025-10-30T17:26:05Z

pandas/_libs/src/parser/tokenizer.c

-    }
-    p = buffer;
+  char *endptr;
+  int status = _str_copy_decimal_str_c(buffer, PROCESSED_WORD_CAPACITY, p_item,


Previously, in the integer parsing functions, it only processed the word if it identified the necessity for it by checking for the presence of tsep. Now you are always doing it. For me, this doesn't seem necessary.

I think the performance of code you said would be better.

However, even if there is no tsep, processing such as skipping whitespaces is necessary, so various string processing logic before and after must be reverted as before.

If I have to, I also think about that maybe it would be better to use copy_string_without_char like before.

So instead of sacrificing a little bit of performance, I think it is reasonable the current version of consolidating and simplifying logic.

However, I'm not an expert, so I’d appreciate your advice. If you still think it’s better to make the change even after considering what I said, I’ll follow your recommendation.

However, even if there is no tsep, processing such as skipping whitespaces is necessary,

I honestly don't see much problem in having the repeated logic of dealing with leading and trailing whitespace as it was before.

If I have to, I also think about that maybe it would be better to use copy_string_without_char like before.

Considering the goal of the issue was to create a superset of this function, I think it's better to use the function that you are rewriting.

So instead of sacrificing a little bit of performance, I think it is reasonable the current version of consolidating and simplifying logic.

This is a valid point. I am ambivalent about this. I would wait for @WillAyd opinion on this topic.

WillAyd · 2025-10-30T19:51:56Z

pandas/_libs/src/parser/tokenizer.c

+#define SKIP_NSPAN(s, n, charset)                                              \
+  str_consume_nspan(NULL, 0, &(s), (n), (charset))
+
+#define SAFE_CONSUME_SPAN(d, de, s, charset)                                   \


This overall seems pretty complicated - what didn't work with strtok?

If _str_copy_decimal_str_c only needed to skip whitespaces, I would have used strtok.
However, the function’s requirements are more complex:

it needs to skip certain character sets selectively,

handle different token types (tsep, decimal, digits), and

avoid modifying the source string (since strtok removes delimiters).

I initially experimented with using strtok just for whitespace skipping, but it only made the code deeper and harder to follow.
Using it to simplify more complex parsing turned out to be difficult — at least with my current approach.

Also, because strtok modifies the input string, I would have needed to make a copy first to avoid side effects.
(I’m not entirely sure how critical those side effects are in this context, but I chose to take the conservative route.)

For these reasons, I decided to use strspn and memcpy instead.
Up until commit 7b0be60, the code remained relatively simple, but after adding repetitive safety checks (such as buffer overflow prevention), it evolved into its current form.

heoh added 13 commits October 20, 2025 15:11

refactor(_str_copy_decimal_str_c): Use parsing functions from stdlib

7a2882b

refactor: extract str_consume_span() function

4a6b66b

refactor: Remove unnecessary conversions

d2ce67d

refactor: Simplify parsing logic by str_consume_span()

7b0be60

Prevent overflow for dst

4cfc5eb

refactor: Remove malloc in _str_copy_decimal_str_c and use caller buffer

4fbac54

refactor: Extract macro for readability

05504cf

refactor: Consolidate _str_copy_decimal_str_c and `copy_string_with…

9de4b2a

…out_char`

refactor: Use stack buffer in round_trip function

9ff0af1

Remove heap allocation (malloc/free) in round_trip by using stack-allocated buffer with PROCESSED_WORD_CAPACITY. This improves performance and simplifies memory management.

Fix implicit type casting warning

ea2c206

Merge branch 'main' into pandas-devgh-62717

ee92b7b

Parse exponential expressions only when decimal is present

b48f81b

Remove duplicate guard conditions

e7fcb65

heoh mentioned this pull request Oct 25, 2025

Consolidate decimal digit -> str functions in tokenizer.c #62717

Open

heoh added 2 commits October 28, 2025 08:58

Merge branch 'main' into pandas-devgh-62717

d32322f

Merge branch 'main' into pandas-devgh-62717

1d97b3a

mroeschke requested review from Alvaro-Kothe and WillAyd October 30, 2025 16:04

mroeschke added the Internals Related to non-user accessible pandas implementation label Oct 30, 2025

Alvaro-Kothe reviewed Oct 30, 2025

View reviewed changes

WillAyd reviewed Oct 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CLN: Consolidate decimal string parsing functions in tokenizer.c #62823

CLN: Consolidate decimal string parsing functions in tokenizer.c #62823

heoh commented Oct 24, 2025 •

edited

Loading

Uh oh!

Alvaro-Kothe Oct 30, 2025

Uh oh!

heoh Oct 30, 2025

Uh oh!

heoh Oct 30, 2025 •

edited

Loading

Uh oh!

Alvaro-Kothe Oct 30, 2025

Uh oh!

WillAyd Oct 30, 2025

Uh oh!

heoh Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

CLN: Consolidate decimal string parsing functions in tokenizer.c #62823

Are you sure you want to change the base?

CLN: Consolidate decimal string parsing functions in tokenizer.c #62823

Conversation

heoh commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Uh oh!

Alvaro-Kothe Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

heoh Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

heoh Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alvaro-Kothe Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

WillAyd Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

heoh Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

heoh commented Oct 24, 2025 •

edited

Loading

heoh Oct 30, 2025 •

edited

Loading